Architecture for implementing a virtualization environment and appliance

ABSTRACT

An improved architecture is provided which enables significant convergence of the components of a system to implement virtualization. The infrastructure is VM-aware, and permits scaled out converged storage provisioning to allow storage on a per-VM basis, while identifying I/O coming from each VM. The current approach can scale out from a few nodes to a large number of nodes. In addition, the inventive approach has ground-up integration with all types of storage, including solid-state drives. The architecture of the invention provides high availability against any type of failure, including disk or node failures. In addition, the invention provides high performance by making I/O access local, leveraging solid-state drives and employing a series of patent-pending performance optimizations.

BACKGROUND

A “virtual machine” or a “VM” refers to a specific software-basedimplementation of a machine in a virtualization environment, in whichthe hardware resources of a real computer (e.g., CPU, memory, etc.) arevirtualized or transformed into the underlying support for the fullyfunctional virtual machine that can run its own operating system andapplications on the underlying physical resources just like a realcomputer.

Virtualization works by inserting a thin layer of software directly onthe computer hardware or on a host operating system. This layer ofsoftware contains a virtual machine monitor or “hypervisor” thatallocates hardware resources dynamically and transparently. Multipleoperating systems run concurrently on a single physical computer andshare hardware resources with each other. By encapsulating an entiremachine, including CPU, memory, operating system, and network devices, avirtual machine is completely compatible with most standard operatingsystems, applications, and device drivers. Most modern implementationsallow several operating systems and applications to safely run at thesame time on a single computer, with each having access to the resourcesit needs when it needs them.

Virtualization allows multiple virtual machines to run on a singlephysical machine, with each virtual machine sharing the resources ofthat one physical computer across multiple environments. Differentvirtual machines can run different operating systems and multipleapplications on the same physical computer.

One reason for the broad adoption of virtualization in modern businessand computing environments is because of the resource utilizationadvantages provided by virtual machines. Without virtualization, if aphysical machine is limited to a single dedicated operating system, thenduring periods of inactivity by the dedicated operating system thephysical machine is not utilized to perform useful work. This iswasteful and inefficient if there are users on other physical machineswhich are currently waiting for computing resources. To address thisproblem, virtualization allows multiple VMs to share the underlyingphysical resources so that during periods of inactivity by one VM, otherVMs can take advantage of the resource availability to processworkloads. This can produce great efficiencies for the utilization ofphysical devices, and can result in reduced redundancies and betterresource cost management.

Many organizations use data centers to implement virtualization, wherethe data centers are typically architected with traditional servers thatcommunicate with a set of networked storage devices over a network. Forexample, many data centers are designed using diskless computers(“application servers”) that communicate with a set of networked storageappliances (“storage servers”) via a network, such as a Fiber Channel orEthernet network.

The problem is that this traditional approach cannot adapt to the moderndemands of virtualization, which is particularly problematic withrespect to the way these traditional architectures manage storage. Onereason for this is because the traditional network storage-basedarchitecture is designed for physical servers that serve relativelystatic workloads, but which is not flexible or adaptable enough toadequately handle the dynamic nature of storage and virtual machinesthat, in a virtualization or cloud computing environment, may be createdor moved on the fly from one network location to another.

Moreover, the traditional approach relies upon very large andspecialized rackmount or freestanding compute and storage devices thatare managed by a central storage manager. This approach does not scalevery well, since the central storage manager becomes a very significantperformance bottleneck as the number of storage devices increase.Moreover, the traditional compute and storage devices are expensive topurchase, maintain, and power, and are large enough to require asignificant investment just in terms of the amount of physical spacethat is needed to implement the data center.

Given these challenges with the traditional data center architectures,it has become clear that the conventional approaches to implement a datacenter for virtualization presents excessive levels of cost andcomplexity, while being very ill-adapted to the needs of modernvirtualization systems. These problems are further exacerbated by thefact that data volumes are constantly growing at a rapid pace in themodern data center, thanks to the ease of creating new VMs. In theenterprise, new initiatives like desktop virtualization contribute tothis trend of increased data volumes. This growing pool of VMs isexerting tremendous cost, performance and manageability pressure on thetraditional architecture that connects compute to storage over amulti-hop network.

Therefore, there is a need for an improved approach to implement anarchitecture for a virtualization data center.

SUMMARY

Embodiments of the present invention provide an improved architecturewhich enables significant convergence of the components of a system toimplement virtualization. The infrastructure is VM-aware, and permitsSOCS provisioning to allow storage on a per-VM basis, while identifyingI/O coming from each VM. The current approach can scale out from a fewnodes to a large number of nodes. In addition, the inventive approachhas ground-up integration with all types of storage, includingsolid-state drives. The architecture of the invention provides highavailability against any type of failure, including disk or nodefailures. In addition, the invention provides high performance by makingI/O access local, leveraging solid-state drives and employing a seriesof patent-pending performance optimizations.

Further details of aspects, objects, and advantages of the invention aredescribed below in the detailed description, drawings, and claims. Boththe foregoing general description and the following detailed descriptionare exemplary and explanatory, and are not intended to be limiting as tothe scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings illustrate the design and utility of embodiments of thepresent invention, in which similar elements are referred to by commonreference numerals. In order to better appreciate the advantages andobjects of embodiments of the invention, reference should be made to theaccompanying drawings. However, the drawings depict only certainembodiments of the invention, and should not be taken as limiting thescope of the invention.

FIG. 1 illustrates an example architecture to implement virtualizationaccording to some embodiments of the invention.

FIG. 2 illustrates an example block to implement virtualizationaccording to some embodiments of the invention.

FIG. 3 illustrates an example architecture to implement I/O and storagedevice management in a virtualization environment according to someembodiments of the invention.

FIGS. 4A-D illustrate example designs for a block according to someembodiments of the invention.

FIG. 5 illustrates a rack of blocks according to some embodiments of theinvention.

FIG. 6 illustrates a network configuration according to some embodimentsof the invention.

FIGS. 7A-C show alternate approaches to implement I/O requests accordingto some embodiments of the invention.

FIG. 8 illustrates a storage hierarchy according to some embodiments ofthe invention.

FIG. 9 illustrates the components of a Controller VM according to someembodiments of the invention.

FIG. 10 illustrates shared vDisks according to some embodiments of theinvention.

FIG. 11 illustrates shared-nothing vDisks according to some embodimentsof the invention.

FIG. 12 is a block diagram of a computing system suitable forimplementing an embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS OF THE INVENTION

Embodiments of the present invention provide an improved approach toimplement virtualization appliances for a datacenter which address andcorrect the problems of the prior art. According to some embodiments,the present invention provides a scalable compute and storageinfrastructure that effectively and efficiently allows organizations tovirtualize their data centers. The virtualization appliance of thepresent invention provides complete compute and storage capabilitiesalong with performance, scalability, availability and data managementfeatures. In some embodiments, the virtualization appliances leveragesindustry-standard hardware components and advanced storage managementsoftware to provide an out-of-the-box solution that makes virtualizationextremely easy and cost effective.

FIG. 1 shows an integrated collection (or “cluster”) 100 ofvirtualization appliances or “blocks” 102 a, 102 b, 102 c, and 102 d.Each of the blocks includes hardware and software to implement avirtualization solution. For example, block 102 b is internallyorganized to include hardware and software to implement multiplevirtualization nodes. Each node runs a standard hypervisor on hardwarethat contains processors, memory and local storage, such as a mix ofSSDs and/or hard disk drives. Each node runs virtual machines just likea standard virtual machine host.

In addition, local storage from all nodes is virtualized into a unifiedstorage pool, which is referred to herein as “scale-out convergedstorage” or “SOCS” 155. As described in more detail below, SOCS 155 actslike an advanced SAN that uses local SSDs and disks from all nodes tostore virtual machine data. Virtual machines running on the clusterwrite data to SOCS as if they were writing to a SAN. SOCS is VM-awareand provides advanced data management features. This approach brings thedata closer to virtual machines by storing the data locally on thesystem (if desired), resulting in higher performance at a lower cost. Asdiscussed in more detail below, this solution can horizontally scalefrom a few nodes to a large number of nodes, enabling organizations toscale their infrastructure as their needs grow.

While traditional SAN solutions typically have 1, 2, 4 or 8 controllers,an n-node system according to the present embodiment has n controllers.Every node in the cluster runs a special virtual machine, called aController VM (or “service VM”), which acts as a virtual controller forSOCS. All Controller VMs in the cluster communicate with each other toform a single distributed system. Unlike traditional SAN/NAS solutionsthat are limited to a small number of fixed controllers, thisarchitecture continues to scale as more nodes are added.

As stated above, each block includes a sufficient collection of hardwareand software to provide a self-contained virtualization appliance, e.g.,as shown in FIG. 2. The example block 200 in FIG. 2 includes four nodes1-4. Having the multiple nodes within a block allows both highperformance and reliability. Performance is increased since there aremultiple independent nodes to handle the virtualization needs of thesystem. Reliability is improved since the multiple nodes provide forredundancy in the event of a possible hardware or software error.Moreover, as discussed below, the software-based storage managementsolution allow for easy movement of data as the storage needs of thesystem changes.

Each node in the block includes both hardware components 202 andsoftware components 204 to implement virtualization. Hardware components202 includes processing capacity (e.g., using one or more processors)and memory capacity (e.g., random access memory or RAM) on a motherboard203. The node also comprises local storage 222, which in someembodiments include Solid State Drives (henceforth “SSDs”) 125 and/orHard Disk Drives (henceforth “HDDs” or “spindle drives”) 127. Anycombination of SSDs and HDDs may be used to implement the local storage222.

The software 204 includes a hypervisor 230 to manage the interactionsbetween the underlying hardware 202 and the one or more user VMs 202 aand 202 b that run client software. A controller VM 210 a exists on eachnode to implement distributed storage management of the local storage222, such that the collected local storage for all nodes can be managedas a combined SOCS.

FIG. 3 illustrates an approach for implementing SOCS-based storagemanagement in a virtualization environment according to some embodimentsof the invention. The architecture of FIG. 3 can be implemented for adistributed platform that contains multiple nodes/servers 300 a and 300b that manages multiple-tiers of storage. The nodes 300 a and 300 b maybe within the same block, or on different blocks in a clusteredenvironment of multiple blocks. The multiple tiers of storage includestorage that is accessible through a network 340, such as cloud storage326 or networked storage 328 (e.g., a SAN or “storage area network”). Inaddition, the present embodiment also permits local storage 322/324 thatis within or directly attached to the server and/or appliance to bemanaged as part of the storage pool 360. As noted above, examples ofsuch storage include any combination of SSDs 325 and/or HDDs 327. Thesecollected storage devices, both local and networked, form a storage pool360.

Virtual disks (or “vDisks”) can be structured from the storage devicesin the storage pool 360, as described in more detail below. As usedherein, the term vDisk refers to the storage abstraction that is exposedby a Controller VM to be used by a user VM. In some embodiments, thevDisk is exposed via iSCSI (“internet small computer system interface”)or NFS (“network file system”) and is mounted as a virtual disk on theuser VM. Each server 300 a or 300 b runs virtualization software, suchas VMware ESX(i), Microsoft Hyper-V, or RedHat KVM. The virtualizationsoftware includes a hypervisor 330/332 to manage the interactionsbetween the underlying hardware and the one or more user VMs 302 a, 302b, 302 c, and 302 d that run client software.

Controller VM 310 a/310 b (also referred to herein as “service VMs”) areused to manage storage and I/O activities. This is the distributed“Storage Controller” in the currently described architecture. Multiplesuch storage controllers coordinate within a cluster to form asingle-system. The Controller VMs 310 a/310 b are not formed as part ofspecific implementations of hypervisors 330/332. Instead, the ControllerVMs run as virtual machines above hypervisors 330/332 on the variousnodes/servers 302 a and 302 b, and work together to form a distributedsystem 310 that manages all the storage resources, including the locallyattached storage 322/324, the networked storage 328, and the cloudstorage 326. Since the Controller VMs run above the hypervisors 330/332,this means that the current approach can be used and implemented withinany virtual machine architecture, since the Controller VMs ofembodiments of the invention can be used in conjunction with anyhypervisor from any virtualization vendor.

Each Controller VM 310 a-b exports one or more block devices or NFSserver targets that appear as disks to the client VMs 302 a-d. Thesedisks are virtual, since they are implemented by the software runninginside the Controller VMs 310 a-b. Thus, to the user VMs 302 a-d, theController VMs 310 a-b appear to be exporting a clustered storageappliance that contains some disks. All user data (including theoperating system) in the client VMs 302 a-d resides on these virtualdisks.

Significant performance advantages can be gained by allowing thevirtualization system to access and utilize local (e.g.,server-internal) storage 322 as disclosed herein. This is because I/Operformance is typically much faster when performing access to localstorage 322 as compared to performing access to networked storage 328across a network 340. This faster performance for locally attachedstorage 322 can be increased even further by using certain types ofoptimized local storage devices, such as SSDs 325. Once thevirtualization system is capable of managing and accessing locallyattached storage, as is the case with the present embodiment, variousoptimizations can then be implemented to improve system performance evenfurther. For example, the data to be stored in the various storagedevices can be analyzed and categorized to determine which specificdevice should optimally be used to store the items of data. Data thatneeds to be accessed much faster or more frequently can be identifiedfor storage in the locally attached storage 322. On the other hand, datathat does not require fast access or which is accessed infrequently canbe stored in the networked storage devices 328 or in cloud storage 326.In addition, the performance of the local storage can be furtherimproved by changing the mix of SSDs and HDDs within the local storage,e.g., by increasing or decreasing the proportion of SSDs to HDDs in thelocal storage.

The present architecture solves storage challenges for virtual machinesproviding a general-purpose scale-out compute and storage infrastructurethat eliminates the need for network storage. In part, this is due tothe distributed nature of the storage controller infrastructure thatutilizes controller VMs to act as a virtual controller for SOCS. Sinceall the Controller VMs in the cluster communicate with each other toform a single distributed system, this eliminates the limitations andperformance bottlenecks associated with traditional SAN solutions thattypically have only 1, 2, 4 or 8 controllers. Therefore, n-node clusterswill essentially have n controllers, providing a solution that willeasily scale to very large data volumes.

In addition, the solution will very effectively support virtualizationand hypervisor functions, within a single virtualization appliance(block) that can be extensively combined with other blocks to supportlarge scale virtualization needs. Since the architecture is VM-aware, itovercomes limitations of traditional solutions that were optimized towork only with physical servers. For example, the present approachovercomes limitations associated with the traditional unit of managementfor storage pertaining to LUNs, where if a LUN is shared by many VMs, itbecomes more difficult to perform storage operations such as backup,recovery, and snapshots on a per-VM basis. It is also difficult toidentify performance bottlenecks in a heavily-shared environment due tothe chasm between computing and storage tiers. The current architectureovercomes these limitations since the storage units (vdisks) are managedacross an entire virtual storage space.

Moreover, the present approach can effectively take advantage ofenterprise-grade solid-state drives (SSDs). Traditional storage systemswere designed for spinning media and it is therefore difficult for thesetraditional systems to leverage SSDs efficiently due to the entirelydifferent access patterns that SSDs provide. While hard disks have todeal with the rotation and seek latencies, SSDs do not have suchmechanical limitations. This difference between the two media requiresthe software to be optimized differently for performance. One cannotsimply take software written for hard disk-based systems and hope to useit efficiently on solid-state drives. The present architecture can useany type of storage media, including SSDs, and can use SSDs to store avariety of frequently-accessed data, from VM metadata to primary datastorage, both in a distributed cache for high-performance and inpersistent storage for quick retrieval.

In some embodiment, to maximize the performance benefits of using SSDs,the present architecture reserves SSDs for I/O-intensive functions andincludes space-saving techniques that allow large amounts of logicaldata to be stored in a small physical space. In addition, the presentapproach can be used to migrate “cold” or infrequently-used data to harddisk drives automatically, allowing administrators to bypass SSDs forlow-priority VMs.

The present architecture therefore provides a solution that enablessignificant convergence of the storage components of the system with thecompute components, allowing VMs and SOCS to co-exist within the samecluster. From a hardware perspective, each block provides a “buildingblock” to implement an expandable unit of virtualization, which is bothself-contained and expandable to provide a solution for any sizedrequirements.

FIGS. 4A-D illustrate a block 400 according to some embodiments of theinvention. As shown in FIG. 4A, each block 400 can be mounted on a rackwith other blocks, and can be instantiated in a chassis that holdmultiple nodes A, B, C, and D. Each node corresponds to a serverboard406 that is insertable into the chassis and which contains one or moreprocessors, memory components, and other hardware components typicallyfound on a motherboard. FIG. 4A shows a perspective view of the block400, showing the serverboards of the nodes in a partially insertedposition. FIG. 4B shows an end view of the block 400, illustrating thearrangement of the nodes/serverboards 406 in a fully inserted positionwithin the chassis. FIG. 4C illustrates an end view of a singleserverboard 406, showing the various connection points for wiring andperipherals for the serverboard, e.g., network connectors.

Each of the serverboards 406 acts as a separate node within the block400. As independent nodes, each node may be powered on or off separatelywithout affecting the others. In addition, the serverboards 406 are hotswappable and may be removed from the end of the chassis withoutaffecting the operation of the other serverboards. This configuration ofmultiple nodes ensures hardware-based redundancy of processing andstorage capacity for the block, with the storage management softwareproviding for operational redundancies of the data stored and managed bythe block.

The block 400 also includes multiple power supply modules 408, e.g., twoseparate modules as shown in FIG. 4B. This provides for powerredundancy, so that failure of a single power supply module will notbring down the whole block 400.

The block 400 supports multiple local storage devices. In someembodiments, the block 400 includes a backplane that allows connectionof six SAS or SATA storage units to each node, for a total of 24 storageunits 404 for the block 400. Any suitable type or configuration ofstorage unit may be connected to the backplane, such as SSDs or HDDs. Insome embodiments, any combination of SSDs and HDDs can be implemented toform the six storage units for each node, including all SSDs, all HDDs,or a mixture of SSDs and HDDs.

FIG. 4D shows an end view of the portion of the block/chassiscorresponding to the storage units. Each of the individual storage units404 are insertable into the chassis of block 400. In addition, thestorage units are hot swappable and may be removed from the end of thechassis without affecting the operation of the other storage units. Thisconfiguration of multiple storage units ensures hardware-basedredundancy of the storage capacity for the block.

The entirety of the block 400 fits within a “2u” or less form factorunit. A rack unit or “u” (also referred to as a “RU”) is a unit ofmeasure used to describe the height of equipment intended for mountingin a rack system. In some embodiments, one rack unit is 1.75 inches(44.45 mm) high. This means that the 2u or less block provides a veryspace-efficient and power-efficient building block for implementing avirtualized data center. The redundancies that are built into the blockmean that there is no single point of failure that exists for the unit.The redundancies also mean that there is no single point of bottleneckfor the performance of the unit.

The blocks are rackable as well, with the block being mountable on astandard 19″ rack. FIG. 5 illustrates a cluster 500 of blocks/nodes on arack that demonstrates the linear scalability of the presentarchitecture from four nodes (one Block) to a much larger number ofnodes. Multiple blocks may be placed on the same rack, interconnectedusing a networking component that also resides on the rack. FIG. 6illustrates an example networking configuration 600 that can be used formultiple pods of blocks to scale to any number of desired virtualizationcapabilities. In this approach, one or more physical switches are usedto interconnect the components on the rack component(s). However,communications are actually fulfilled using a virtual switch technologyto address the virtualization components. This allows the components toshare a small number (e.g., one) of “fat” pipes to handle the messagingtraffic.

FIG. 7A illustrates an example approach that can be taken in someembodiments of the invention to use virtual switches to communicateto/from the Controller VMs 710 a/710 b on the different nodes. In thisapproach, the user VM 702 sends I/O requests 750 a to the Controller VMsin the form of iSCSI or NFS requests. The term “iSCSI” or “InternetSmall Computer System Interface” refers to an IP-based storagenetworking standard for linking data storage facilities together. Bycarrying SCSI commands over IP networks, iSCSI can be used to facilitatedata transfers over intranets and to manage storage over any suitabletype of network or the Internet. The iSCSI protocol allows iSCSIinitiators to send SCSI commands to iSCSI targets at remote locationsover a network. In another embodiment of the invention, the user VM 702sends I/O requests 750 b to the Controller VMs in the form of NFSrequests. The term “NFS” or “Network File System” interface refers to anIP-based file access standard in which NFS clients send file-basedrequests to NFS servers via a proxy folder (directory) called “mountpoint”. Going forward, this disclosure will interchangeably use the termiSCSI and NFS to refer to the IP-based protocol used to communicatebetween the hypervisor and the Controller VM. Note that while bothprotocols are network-based, the currently described architecture makesit possible to use them over the virtual network within the hypervisor.No iSCSI or NFS packets will need to leave the machine, because thecommunication—the request and the response—begins and ends within thesingle hypervisor host.

Here, the user VM 702 structures its I/O requests into the iSCSI format.The iSCSI or NFS request 750 a designates the IP address for aController VM from which the user VM 702 desires I/O services. The iSCSIor NFS request 750 a is sent from the user VM 702 to a virtual switch752 within hypervisor 752 to be routed to the correct destination. Ifthe request is to be intended to be handled by the Controller VM 710 awithin the same server 700 a, then the iSCSI or NFS request 750 a isinternally routed within server 700 a to the Controller VM 710 a. Asdescribed in more detail below, the Controller VM 710 a includesstructures to properly interpret and process that request 750 a.

It is also possible that the iSCSI or NFS request 750 a will be handledby a Controller VM 710 b on another server 700 b. In this situation, theiSCSI or NFS request 750 a will be sent by the virtual switch 752 to areal physical switch to be sent across network 740 to the other server700 b. The virtual switch 755 within the hypervisor 733 on the server733 will then route the request 750 a to the Controller VM 710 b forfurther processing.

FIG. 7B illustrates an alternate approach in which the I/O requests fromthe user VM 702 is in the normal SCSI protocol to a storage device. Thehypervisor then converts this SCSI request into an iSCSI or an NFSrequest as part of its hardware emulation layer. In other words, thevirtual SCSI disk attached to the user VM is either an iSCSI LUN or anNFS file in an NFS server. In this approach, an iSCSI initiator 772 orthe NFS client software is employed to convert the SCSI-formattedrequests into the appropriate iSCSI- or NFS-formatted requests that canbe handled by the Controller VM 710 a. The advantage of this approachover the approach of FIG. 7A is that there is no need to individuallyreconfigure or make sure that the software for the user VMs 702 can workwith the iSCSI or NFS protocol.

According to some embodiments, the controller VM runs the Linuxoperating system. As noted above, since the controller VM exports ablock-device or file-access interface to the user VMs, the interactionbetween the user VMs and the controller VMs follows the iSCSI or NFSprotocol, either directly or indirectly via the hypervisor's hardwareemulation layer.

For easy management of the appliance, the Controller VMs all have thesame IP address isolated by internal VLANs (virtual LANs in the virtualswitch of the hypervisor). FIG. 7C illustrates this aspect of thearchitecture. The Controller VM 710 a on node 700 a implements twovirtual network interface cards (NICs) 761 a and 761 b. One of thevirtual NICs 761 a corresponds to an internal VLAN that permits the UserVM 702 to communicate with the Controller VM 710 a using the common IPaddress. The virtual switch 760 would therefore route all communicationsinternal to the node 700 a between the User VM 702 and the Controller VM710 a using the first virtual NIC 761 a, where the common IP address ismanaged to correspond to the Controller VM 710 a due to its membershipin the appropriate VLAN.

The second virtual NIC 761 b is used to communicate with entitiesexternal to the node 700 a, where the virtual NIC 761 b is associatedwith an IP address that would be specific to Controller VM 710 a (and noother controller VM). The second virtual NIC 761 b is therefore used toallow Controller VM 710 a to communicate with other controller VMs, suchas Controller VM 710 b on node 700 b. It is noted that Controller VM 710b would likewise utilize VLANs and multiple virtual NICs 763 a and 763 bto implement management of the appliance.

For easy management of the appliance, the storage is divided up intoabstractions that have a hierarchical relationship to each other. FIG. 8illustrates the storage hierarchy of the storage objects according tosome embodiments of the invention, where all storage in the storageappliance collectively forms a Storage Universe. These storage devicesmay encompass any suitable devices, such as SSDs, HDDs on the variousservers (“server-internal” or local storage), SAN, and Cloud storage.

Storage with similar characteristics is classified into tiers. Thus, allSSDs can be classified into a first tier and all HDDs may be classifiedinto another tier etc. In a heterogeneous system with different kinds ofHDDs, one may classify the disks into multiple HDD tiers. This actionmay similarly be taken for SAN and cloud storage.

The storage universe is divided up into storage pools—essentially acollection of specific storage devices. An administrator may beresponsible for deciding how to divide up the storage universe intostorage pools. For example, an administrator may decide to just make onestorage pool with all the disks in the storage universe in that pool.However, the principal idea behind dividing up the storage universe isto provide mutual exclusion—fault isolation, performance isolation,administrative autonomy—when accessing the disk resources.

This may be one approach that can be taken to implement QoS techniques.For example, one rogue user may result in an excessive number of randomJO activity on a hard disk—thus if other users are doing sequential IO,they still might get hurt by the rogue user. Enforcing exclusion(isolation) through storage pools might be used to provide hardguarantees for premium users. Another reason to use a storage pool mightbe to reserve some disks for later use (field replaceable units, or“FRUs”).

As noted above, the Controller VM is the primary software componentwithin the server that virtualizes I/O access to hardware resourceswithin a storage pool according to embodiments of the invention. Thisapproach essentially provides for a separate and dedicated controllerfor each and every node within a virtualized data center (a cluster ofnodes that run some flavor of hypervisor virtualization software), sinceeach node will include its own Controller VM. This is in contrast toconventional storage architectures that provide for a limited number ofstorage controllers (e.g., four controllers) to handle the storageworkload for the entire system, and hence results in significantperformance bottlenecks due to the limited number of controllers. Unlikethe conventional approaches, each new node will include a Controller VMto share in the overall workload of the system to handle storage tasks.Therefore, the current approach is infinitely scalable, and provides asignificant advantage over the conventional approaches that have alimited storage processing power. Consequently, the currently describedapproach creates a massively-parallel storage architecture that scalesas and when hypervisor hosts are added to a datacenter.

FIG. 9 illustrates the internal structures of a Controller VM accordingto some embodiments of the invention. As previously noted, theController VMs are not formed as part of specific implementations ofhypervisors. Instead, the Controller VMs run as virtual machines abovehypervisors on the various nodes. Since the Controller VMs run above thehypervisors, this means that the current approach can be used andimplemented within any virtual machine architecture, since theController VMs of embodiments of the invention can be used inconjunction with any hypervisor from any virtualization vendor.Therefore, the Controller VM can be configured to operate ubiquitouslyanywhere within the computing environment, and will not need to becustom-configured for each different type of operating environment. Thisis particularly useful because the industry-standard iSCSI or NFSprotocols allow the Controller VM to be hypervisor-agnostic.

The main entry point into the Controller VM is the central controllermodule 804 (which is referred to here as the “I/O Director module 804”).The term I/O Director module is used to connote that fact that thiscomponent directs the I/O from the world of virtual disks to the pool ofphysical storage resources. In some embodiments, the I/O Director moduleimplements the iSCSI or NFS protocol server.

A write request originating at a user VM would be sent to the iSCSI orNFS target inside the controller VM's kernel. This write would beintercepted by the I/O Director module 804 running in user space. I/ODirector module 804 interprets the iSCSI LUN or the NFS file destinationand converts the request into an internal “vDisk” request (e.g., asdescribed in more detail below). Ultimately, the I/O Director module 804would write the data to the physical storage.

Each vDisk managed by a Controller VM corresponds to a virtual addressspace forming the individual bytes exposed as a disk to user VMs. Thus,if the vDisk is of size 1 TB, the corresponding address space maintainedby the invention is 1 TB. This address space is broken up into equalsized units called vDisk blocks. Metadata 810 is maintained by theController VM to track and handle the vDisks and the data and storageobjects in the system that pertain to the vDisks. The Metadata 810 isused to track and maintain the contents of the vDisks and vDisk blocks.

In order to determine where to write and read data from the storagepool, the I/O Director module 804 communicates with a DistributedMetadata Service module 830 that maintains all the metadata 810. In someembodiments, the Distributed Metadata Service module 830 is a highlyavailable, fault-tolerant distributed service that runs on all theController VMs in the appliance. The metadata managed by DistributedMetadata Service module 830 is itself kept on the persistent storageattached to the appliance. According to some embodiments of theinvention, the Distributed Metadata Service module 830 may beimplemented on SSD storage.

Since requests to the Distributed Metadata Service module 830 may berandom in nature, SSDs can be used on each server node to maintain themetadata for the Distributed Metadata Service module 830. TheDistributed Metadata Service module 830 stores the metadata that helpslocate the actual content of each vDisk block. If no information isfound in Distributed Metadata Service module 830 corresponding to avDisk block, then that vDisk block is assumed to be filled with zeros.The data in each vDisk block is physically stored on disk in contiguousunits called extents. Extents may vary in size when de-duplication isbeing used. Otherwise, an extent size coincides with a vDisk block.Several extents are grouped together into a unit called an extent group.An extent group is then stored as a file on disk. The size of eachextent group is anywhere from 16 MB to 64 MB. In some embodiments, anextent group is the unit of recovery, replication, and many otherstorage functions within the system.

Further details regarding methods and mechanisms for implementing aController VM are described below and in co-pending application Ser. No.13/207,345, filed on Aug. 10, 2011, which is hereby incorporated byreference in its entirety. Further details regarding methods andmechanisms for implementing Metadata 910 are described below and inco-pending application Ser. No. 13/207,357, filed on Aug. 10, 2011,which is hereby incorporated by reference in its entirety.

A health management module 808 (which may hereinafter be referred to asa “Curator”) is employed to address and cure any inconsistencies thatmay occur with the Metadata 810. The Curator 808 oversees the overallstate of the virtual storage system, and takes actions as necessary tomanage the health and efficient performance of that system. According tosome embodiments of the invention, the curator 808 operates on adistributed basis to manage and perform these functions, where a mastercurator on a first server node manages the workload that is performed bymultiple slave curators on other server nodes. MapReduce operations areperformed to implement the curator workload, where the master curatormay periodically coordinate scans of the metadata in the system tomanage the health of the distributed storage system. Further detailsregarding methods and mechanisms for implementing Curator 808 aredisclosed in co-pending application Ser. No. 13/207,365, filed on Aug.10, 2011, which is hereby incorporated by reference in its entirety.

Some of the Controller VMs also includes a Distributed ConfigurationDatabase module 806 to handle certain administrative tasks. The primarytasks performed by the Distributed Configuration Database module 806 areto maintain configuration data 812 for the Controller VM and act as anotification service for all events in the distributed system. Examplesof configuration data 812 include, for example, (1) the identity andexistence of vDisks; (2) the identity of Controller VMs in the system;(3) the physical nodes in the system; and (4) the physical storagedevices in the system. For example, assume that there is a desire to adda new physical disk to the storage pool. The Distributed ConfigurationDatabase module 806 would be informed of the new physical disk, afterwhich the configuration data 812 is updated to reflect this informationso that all other entities in the system can then be made aware for thenew physical disk. In a similar way, the addition/deletion of vDisks,VMs and nodes would handled by the Distributed Configuration Databasemodule 806 to update the configuration data 812 so that other entitiesin the system can be made aware of these configuration changes.

Another task that is handled by the Distributed Configuration Databasemodule 806 is to maintain health information for entities in the system,such as the Controller VMs. If a Controller VM fails or otherwisebecomes unavailable, then this module tracks this health information sothat any management tasks required of that failed Controller VM can bemigrated to another Controller VM.

The Distributed Configuration Database module 806 also handles electionsand consensus management within the system. Another task handled by theDistributed Configuration Database module is to implement ID creation.Unique IDs are generated by the Distributed Configuration Databasemodule as needed for any required objects in the system, e.g., forvDisks, Controller VMs, extent groups, etc. In some embodiments, the IDsgenerated are 64-bit IDs, although any suitable type of IDs can begenerated as appropriate for embodiment so the invention. According tosome embodiments of the invention, the Distributed ConfigurationDatabase module 806 may be implemented on an SSD storage because of thereal-time guarantees required to monitor health events.

The vDisks can either be unshared (read and written by a single user VM)or shared (accessed by multiple user VMs or hypervisors) according toembodiments of the invention. FIG. 10 illustrates the shared vDiskscenario, in which a vDisk 923 can be accessed by multiple user VMs 902a and 902 b on different server nodes 900 a and 900 b, respectively. Inthe example of FIG. 9, the shared vDisk 923 is owned by Controller VM910 b on server node 900 b. Therefore, all I/O requests for vDisk 923will be directed to this Controller VM 910 b using standard IPforwarding (Network Address Translation) rules in the networking stackof the Controller VMs.

For I/O requests 950 b from a user VM 902 b that resides on the sameserver node 900 b, the process to handle the I/O requests 950 b isstraightforward, and is conducted as described above. Essentially, theI/O request is in the form of an iSCSI or NFS request that is directedto a given IP address. The IP address for the I/O request is common forall the Controller VM on the different server nodes, but VLANs allowsthe IP address of the iSCSI or NFS request to be private to a particular(local) subnet, and hence the I/O request 950 b will be sent to thelocal Controller VM 910 b to handle the I/O request 950 b. Since localController VM 910 b recognizes that it is the owner of the vDisk 923which is the subject of the I/O request 950 b, the local Controller VM910 b will directly handle the I/O request 950 b.

Consider the situation if a user VM 902 a on a server node 900 a issuesan I/O request 950 a for the shared vDisk 923, where the shared vDisk923 is owned by a Controller VM 910 b on a different server node 900 b.Here, the I/O request 950 a is sent as described above from the user VM902 a to its local Controller VM 910 a. However, the Controller VM 910 awill recognize that it is not the owner of the shared vDisk 923.Instead, the Controller VM 910 a will recognize that Controller VM 910 bis the owner of the shared vDisk 923. In this situation, the I/O requestwill be forwarded from Controller VM 910 a to Controller VM 910 b sothat the owner (Controller VM 910 b) can handle the forwarded I/Orequest. To the extent a reply is needed, the reply would be sent to theController VM 910 a to be forwarded to the user VM 902 a that hadoriginated the I/O request 950 a.

In some embodiments, an IP table 902 (e.g., a network address table or“NAT”) is maintained inside the Controller VM 910 a. The IP table 902 ismaintained to include the address of the remote Server VMs. When thelocal Controller VM 910 a recognizes that the I/O request needs to besent to another Controller VM 910 b, the IP table 902 is used to look upthe address of the destination Controller VM 910 b. This “NATing” actionis performed at the network layers of the OS stack at the Controller VM910 a, when the local Controller VM 910 a decides to forward the IPpacket to the destination Controller VM 910 b.

FIG. 11 shows an example of a “shared nothing” system, in which thevDisks 1023 a and 1023 b are un-shared vDisks. Therefore, each vDisk inthe shared nothing system will be accessed by at most one user VM. Here,vDisk 1023 a is un-shared and is accessed only by user VM 1002 a onserver node 1000 a. Similarly, vDisk 1023 b is un-shared and is accessedonly by user VM 1002 b on server node 1000 b.

Each un-shared vDisk is owned by the Controller VM that is local to theuser VM which accesses that vDisk on the shared-nothing basis. In thecurrent example, vDisk 1023 a is owned by Controller VM 1010 a sincethis Controller VM is on the same server node 1000 a as the user VM 1002a that accesses this vDisk. Similarly, vDisk 1023 b is owned byController VM 1010 b since this Controller VM is on the same server node1000 b as the user VM 1002 b that accesses this vDisk.

I/O requests 1050 a that originate user VM 1002 a would therefore behandled by its local Controller VM 1023 a on the same server node 1000a. Similarly, I/O requests 1050 b that originate user VM 1002 b wouldtherefore be handled by its local Controller VM 1023 b on the sameserver node 1000 b. This is implemented using the same approachpreviously described above, in which the I/O request in the form of aniSCSI or NFS request is directed to a given IP address, and where VLANsallows the IP address of the iSCSI or NFS request to be private to aparticular (local) subnet where the I/O request 950 b will be sent tothe local Controller VM to handle the I/O request. Since localController VM recognizes that it is the owner of the vDisk which is thesubject of the I/O request, the local Controller VM will directly handlethe I/O request.

It is possible that a user VM will move or migrate from one node toanother node. Various virtualization vendors have implementedvirtualization software that allows for such movement by user VMs. Forshared vDisks, this situation does not necessarily affect theconfiguration of the storage system, since the I/O requests will berouted to the owner Controller VM of the shared vDisk regardless of thelocation of the user VM. However, for unshared vDisks, movement of theuser VMs could present a problem since the I/O requests are handled bythe local Controller VMs.

Therefore, what has been described is an improved architecture thatenables significant convergence of the components of a system toimplement virtualization. The infrastructure is VM-aware, and permitsSOCS provisioning to allow storage on a per-VM basis, while identifyingI/O coming from each VM. The current approach can scale out from a fewnodes to a large number of nodes. In addition, the inventive approachhas ground-up integration with all types of storage, includingsolid-state drives. The architecture of the invention provides highavailability against any type of failure, including disk or nodefailures. In addition, the invention provides high performance by makingI/O access local, leveraging solid-state drives and employing a seriesof patent-pending performance optimizations.

System Architecture

FIG. 12 is a block diagram of an illustrative computing system 1400suitable for implementing an embodiment of the present invention.Computer system 1400 includes a bus 1406 or other communicationmechanism for communicating information, which interconnects subsystemsand devices, such as processor 1407, system memory 1408 (e.g., RAM),static storage device 1409 (e.g., ROM), disk drive 1410 (e.g., magneticor optical), communication interface 1414 (e.g., modem or Ethernetcard), display 1411 (e.g., CRT or LCD), input device 1412 (e.g.,keyboard), and cursor control.

According to one embodiment of the invention, computer system 1400performs specific operations by processor 1407 executing one or moresequences of one or more instructions contained in system memory 1408.Such instructions may be read into system memory 1408 from anothercomputer readable/usable medium, such as static storage device 1409 ordisk drive 1410. In alternative embodiments, hard-wired circuitry may beused in place of or in combination with software instructions toimplement the invention. Thus, embodiments of the invention are notlimited to any specific combination of hardware circuitry and/orsoftware. In one embodiment, the term “logic” shall mean any combinationof software or hardware that is used to implement all or part of theinvention.

The term “computer readable medium” or “computer usable medium” as usedherein refers to any medium that participates in providing instructionsto processor 1407 for execution. Such a medium may take many forms,including but not limited to, non-volatile media and volatile media.Non-volatile media includes, for example, optical or magnetic disks,such as disk drive 1410. Volatile media includes dynamic memory, such assystem memory 1408.

Common forms of computer readable media includes, for example, floppydisk, flexible disk, hard disk, magnetic tape, any other magneticmedium, CD-ROM, any other optical medium, punch cards, paper tape, anyother physical medium with patterns of holes, RAM, PROM, EPROM,FLASH-EPROM, any other memory chip or cartridge, or any other mediumfrom which a computer can read.

In an embodiment of the invention, execution of the sequences ofinstructions to practice the invention is performed by a single computersystem 1400. According to other embodiments of the invention, two ormore computer systems 1400 coupled by communication link 1415 (e.g.,LAN, PTSN, or wireless network) may perform the sequence of instructionsrequired to practice the invention in coordination with one another.

Computer system 1400 may transmit and receive messages, data, andinstructions, including program, i.e., application code, throughcommunication link 1415 and communication interface 1414. Receivedprogram code may be executed by processor 1407 as it is received, and/orstored in disk drive 1410, or other non-volatile storage for laterexecution.

In the foregoing specification, the invention has been described withreference to specific embodiments thereof. It will, however, be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader spirit and scope of the invention. Forexample, the above-described process flows are described with referenceto a particular ordering of process actions. However, the ordering ofmany of the described process actions may be changed without affectingthe scope or operation of the invention. The specification and drawingsare, accordingly, to be regarded in an illustrative rather thanrestrictive sense.

What is claimed is:
 1. A system, comprising: an appliance forimplementing a virtualization system comprising compute components andstorage components, the appliance having a plurality of serverboards, inwhich a respective serverboard corresponds to a single node of aplurality of nodes, wherein the single node comprises a hypervisor, oneor more virtual machines, a controller virtual machine and one or morelocal storage devices, wherein the controller virtual machine runs as avirtual machine above the hypervisor and manages storage and I/Oactivities for the one or more virtual machines, wherein the applianceis rack mountable and the plurality of serverboards are inserted intothe appliance, wherein one or more appliances operatively coupled to oneanother correspond to the virtualization system; a plurality of sets ofstorage units, wherein a set of storage units from the plurality of setsof storage units comprises a local storage device that is locallyattached to a corresponding serverboard of a node, and the plurality ofsets of storage units comprise a storage cluster that is structured fromportions of local storage devices on separate nodes; and whereincontroller virtual machines, from the plurality of nodes in thevirtualization system, in communication with each other form a storagepool comprising the plurality of sets of storage units.
 2. The system ofclaim 1, wherein the appliance has a height of 2 rack units or less. 3.The system of claim 1, wherein the appliance comprises enough redundancysuch that it does not have a single point of failure.
 4. The system ofclaim 1, wherein the appliance comprises enough redundancy such that itdoes not have a single point of performance bottleneck.
 5. The system ofclaim 1, in which the appliance is-interconnected with other appliances.6. The system of claim 1, in which the sets of storage units compriseany combination of SSDs (Solid State Drives) and HDDs (Hard DiskDrives).
 7. The system of claim 1, wherein iSCSI or NFS is used tocommunicate between the one or more user virtual machines and thecontroller virtual machine via a virtual switch in the hypervisor. 8.The system of claim 1, in which a set of storage units from theplurality of sets of storage units are implemented as a shared virtualdisk in a virtualization environment.
 9. The system of claim 8, furthercomprising: a plurality of nodes, wherein the plurality of nodesimplements a virtualization environment, and a node comprises ahypervisor, user virtual machines, and a storage controller implementedas a controller virtual machine; a plurality of storage devices that areaccessed by the user virtual machines and are managed by the storagecontroller, in which a virtual disk is formed from the plurality ofstorage devices and the virtual disk can be accessed by both a firstuser virtual machine and a second user virtual machine; the first uservirtual machine is associated with a first storage controller and thesecond user virtual machine is associated with a second storagecontroller; the virtual disk is owned by the first storage controller;and I/O requests that are issued by second user virtual machine areforwarded by the second storage controller to the first storagecontroller.
 10. The system of claim 1, further comprising: a pluralityof nodes, wherein the plurality of nodes implements a virtualizationenvironment, and a node comprises a hypervisor, user virtual machines,and a storage controller implemented as a controller virtual machine; aplurality of storage devices that are accessed by the user virtualmachines and are managed by the storage controller, in which a first anda second virtual disk (vDisk) are formed from the plurality of storagedevices and the first virtual disk can be exclusively accessed by afirst user virtual machine and the second virtual disk can beexclusively accessed by a second user virtual machine; the first uservirtual machine is associated with a first storage controller and thefirst virtual disk is owned by the first storage controller where firstvirtual disk I/O requests that are issued by the first user virtualmachine are handled by the first storage controller; and the second uservirtual machine is associated with a second storage controller and thesecond virtual disk is owned by the first storage controller wheresecond virtual disk I/O requests that are issued by the second uservirtual machine are handled by the second storage controller.
 11. Thesystem of claim 1, comprising computer readable storage medium havingcomputer executable code for handling storage in response to migrationof a virtual machine in a virtualization environment, comprising:receiving an I/O request for a virtual disk; determining that acontroller virtual machine which implements a storage controller is notregistered as owner of the virtual disk; obtaining ownership of thevirtual disk; and handling the I/O request after obtaining ownership ofthe virtual disk.
 12. The system of claim 1, comprising computerreadable storage medium having computer executable code for handlingstorage in response to failure of a storage controller virtual machinein a virtualization environment, comprising: identifying a failure of afirst storage controller virtual machine in a virtualizationenvironment, wherein the storage controller virtualization machine isowner of a virtual disk; identifying a candidate replacement owner forthe virtual disk, wherein the candidate replacement owner is a secondstorage controller virtual machine in the virtualization environment;transferring ownership of the virtual disk to the candidate replacementowner; and using the second storage controller virtual machine to handleI/O requests for the virtual disk.
 13. The system of claim 1, comprisinga controller virtual machine that runs as a virtual machine above ahypervisor in a node of the plurality of nodes.
 14. The system of claim1, comprising a controller virtual machine that manages a virtual diskthat is exposed to a user virtual machine.
 15. The system of claim 14 inwhich the virtual disk corresponds to one or more block devices orserver targets.
 16. The system of claim 1 comprising a controllervirtual machine that operates as a storage controller that is dedicatedto a single node.
 17. The system of claim 16 in which a request forstorage managed by a second service virtual machine are sent to behandled by the second service virtual machine.
 18. The system of claim16 in which a new node that is added to a system corresponds to a newcontroller virtual machine that acts as the storage controller for thenew node.
 19. The system of claim 1 comprising a controller virtualmachine for each of the plurality of nodes correspond to a same IPaddress isolated by an internal network.
 20. The system of claim 1comprising a controller virtual machine that comprises distributedmetadata service module to maintain metadata for virtual disks managedby the controller virtual machine.
 21. The system of claim 1 comprisinga controller virtual machine that comprises a health management moduleto maintain consistency of metadata for virtual disks managed by thecontroller virtual machine.
 22. The system of claim 1 comprising acontroller virtual machine that comprises a distributed configurationdatabase module to maintain configuration data for the controllervirtual machine.